-
Notifications
You must be signed in to change notification settings - Fork 245
feat: implement sync mode #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Claude finished @tac0turtle's task in 2m 33s —— View job Code Review CompleteI've performed a comprehensive code review of the sync mode implementation. This is a well-architected feature that adds intelligent dual-mode synchronization. Below is my detailed analysis. Overview AssessmentThis PR introduces a sophisticated sync mode system that allows nodes to dynamically switch between:
The implementation is generally solid with good architectural decisions. However, I've identified several critical issues and areas for improvement. Critical Issues1. Potential Data Race in processSubscriptionResponse (syncer.go:566-579)The loop iterates over events and takes the address of the loop variable for _, event := range events {
select {
case s.heightInCh <- event:
// ...
default:
s.cache.SetPendingEvent(event.Header.Height(), &event) // ⚠️ Taking address of loop variableIssue: While Go 1.22+ fixed the loop variable capture issue for the variable itself, taking the address of a loop variable still requires care. Each iteration reuses the same memory location for Impact: HIGH - Multiple cached events will incorrectly reference the same (last) event, causing data loss and sync failures. Fix: Add explicit capture: for _, event := range events {
event := event // Explicit capture
select {
case s.heightInCh <- event:
// ...
default:
s.cache.SetPendingEvent(event.Header.Height(), &event)Location: block/internal/syncing/syncer.go:566-579 2. Missing Timeout on Subscription Context (syncer.go:452-543)The subscription in subCtx, cancel := context.WithCancel(s.ctx)
defer cancel()
headerCh, err := s.daClient.Subscribe(subCtx, headerNS)Issue: If the subscription hangs during creation or the watchdog never fires (e.g., one namespace keeps sending events), the goroutine could be stuck indefinitely. Impact: MEDIUM - Potential goroutine leak and inability to switch back to catchup mode. Recommendation: Consider adding a maximum subscription duration or ensure the watchdog is robust enough to detect all stuck scenarios. 3. Race Condition in Mode Determination (syncer.go:363-374)Mode switches are tracked but there's a potential race between mode := s.determineSyncMode()
previousMode := SyncMode(s.currentSyncMode.Load())
if mode != previousMode {
s.currentSyncMode.Store(int32(mode))
s.metrics.ModeSwitches.Add(1)Issue: Between determining the mode and storing it, another goroutine could change the mode, leading to incorrect metrics and logging. Impact: LOW - Metrics and logs may be slightly inaccurate, but won't affect correctness of sync operation. Recommendation: Consider compare-and-swap for mode updates or accept the minor race (acceptable for metrics). High-Priority Issues4. Insufficient Error Context in subscribeAndFollow (syncer.go:440-447)Subscription failures increment a metric but don't provide detailed context for debugging: err := s.subscribeAndFollow()
if err != nil && !errors.Is(err, context.Canceled) {
s.metrics.SubscribeErrors.Add(1)
s.logger.Warn().Err(err).Msg("subscribe failed, will retry via mode check")Issue: No context about which namespace failed, what the network state was, or how many times this has failed in a row. Recommendation: Add structured logging fields (namespace, DA height, consecutive failures) to aid debugging production issues. 5. LocalHead Error Handling Could Be More Robust (syncer.go:386-406)When localHead, err := s.daClient.LocalHead(s.ctx)
if err != nil {
s.logger.Debug().Err(err).Msg("failed to get local DA head, defaulting to catchup mode")
return SyncModeCatchup
}Issue: This masks potentially serious DA client issues. If LocalHead consistently fails, the node will never enter follow mode. Recommendation:
6. ProcessBlobs Method Unexported but Called via Type Assertion (syncer.go:563)events := s.daRetriever.ProcessBlobs(s.ctx, blobs, resp.Height)The Current Status: ✅ This is actually correct - Medium-Priority Issues7. Test Coverage is Low (Codecov Report)Codecov shows 36.25% patch coverage with 109 lines missing:
Issue: Critical paths like subscription handling, mode switching, and error scenarios lack test coverage. Recommendation: Add tests for:
8. Watchdog Timeout Calculation (syncer.go:493-496)watchdogTimeout := s.config.DA.BlockTime.Duration * followWatchdogMultiplier
if watchdogTimeout <= 0 {
watchdogTimeout = 30 * time.Second
}Issue: If Recommendation: Enforce a minimum watchdog timeout (e.g., 5 seconds) regardless of block time. 9. Missing Cancellation on Subscription Errors (syncer.go:462-474)If one subscription succeeds but another fails, the successful subscription's context isn't explicitly cancelled in the error path: headerCh, err := s.daClient.Subscribe(subCtx, headerNS)
if err != nil {
return fmt.Errorf("failed to subscribe to header namespace: %w", err)
}
// If this fails, headerCh's subscription is still active via subCtx
dataCh, err := s.daClient.Subscribe(subCtx, dataNS)
if err != nil {
return fmt.Errorf("failed to subscribe to data namespace: %w", err)
}Current Status: Actually OK - the Recommendation: Add a comment explaining that defer handles cleanup. 10. LocalDA RPC Missing Implementation (tools/local-da/rpc.go:133-137)The LocalDA test harness returns a closed channel for Subscribe: func (s *blobServer) Subscribe(_ context.Context, _ libshare.Namespace) (<-chan *jsonrpc.SubscriptionResponse, error) {
ch := make(chan *jsonrpc.SubscriptionResponse)
close(ch)
return ch, nil
}Issue: This will cause follow mode tests to immediately fail with "subscription closed" error. Recommendation: Implement a basic subscription mechanism for testing, even if it's just a timer-based mock. Code Quality Issues11. Magic Numbers
Recommendation: Add comments explaining the rationale, or make these configurable. 12. Redundant Nil Check (syncer.go:513)case resp, ok := <-dataCh:
// Note: if dataCh is nil (same namespace as header), this case never fires
if !ok {
return errors.New("data subscription closed")The comment correctly notes that selecting on a nil channel blocks forever, making the nil check conceptually redundant. However, the check is for channel closure, not nil, so this is actually correct. Status: ✅ Code is correct as-is. Security Considerations13. No Rate Limiting on Subscription EventsThe subscription processing loop has no rate limiting: case resp, ok := <-headerCh:
if !ok {
return errors.New("header subscription closed")
}
if err := s.processSubscriptionResponse(resp); err != nil {
s.logger.Error().Err(err).Uint64("height", resp.Height).Msg("failed to process header subscription")
}Issue: A malicious or buggy DA node could flood subscriptions with events, overwhelming the syncer. Recommendation:
14. Subscription Context LifecycleThe subscription context is cancelled when the function returns, but there's no explicit tracking of subscription goroutines: subCtx, cancel := context.WithCancel(s.ctx)
defer cancel()Status: ✅ This is correct - cancelling the context will cause the subscription channels to close. Performance Considerations15. Subscription Response ProcessingEach subscription event processes blobs synchronously: if err := s.processSubscriptionResponse(resp); err != nil {
s.logger.Error().Err(err).Uint64("height", resp.Height).Msg("failed to process header subscription")
}Issue: Large blob batches could block the subscription loop, preventing other subscriptions from being processed. Recommendation: Consider processing blobs in a worker pool or at least with a timeout. 16. Memory Usage in ProcessBlobsThe var events []common.DAHeightEvent
// ...
events = append(events, event)Issue: A single DA height with thousands of blocks could cause memory spikes. Recommendation: Process and send events incrementally rather than accumulating all events before sending. Architecture & Design17. Excellent Separation of Concerns ✅The PR correctly separates:
This makes the code testable and maintainable. 18. Good Interface Design ✅Adding type Client interface {
// ...
Subscribe(ctx context.Context, namespace []byte) (<-chan *blobrpc.SubscriptionResponse, error)
LocalHead(ctx context.Context) (uint64, error)
}This maintains testability and allows for different DA implementations. 19. Metrics Coverage is Good ✅The new metrics provide good observability:
Recommendation: Consider adding:
Testing Recommendations20. Missing Test ScenariosThe PR adds
Recommendation: Add integration tests covering these scenarios. Documentation21. Well-Documented Constants ✅The constants have clear documentation: // catchupThreshold is the number of DA blocks behind local head
// before switching from follow to catchup mode.
catchupThreshold = 222. Missing Package-Level DocumentationThe new sync mode behavior isn't documented at the package level. Recommendation: Add a package comment or design doc explaining:
Summary & RecommendationsMust Fix Before Merge (Critical)
Should Fix Before Merge (High Priority)
Consider for Follow-Up
Positive Highlights
Overall AssessmentThis is a well-architected feature that significantly improves DA synchronization efficiency. The dual-mode approach is sound, and the implementation demonstrates good understanding of concurrency patterns. Primary concerns:
Once these issues are addressed, this will be a solid addition to the codebase. Recommendation: Request changes for the critical loop variable issue and test coverage improvements. |
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Summary of ChangesHello @tac0turtle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data availability (DA) worker's synchronization capabilities by introducing dynamic sync modes. It allows the system to efficiently transition between an aggressive polling strategy when catching up to the DA chain and a more resource-optimized, subscription-based approach once it reaches the chain head. This change aims to improve the overall performance and responsiveness of DA data processing. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dual-mode synchronization mechanism, allowing the node to switch between an aggressive polling 'catchup' mode and a more efficient subscription-based 'follow' mode. This is a significant enhancement for nodes that are at the head of the chain. The changes are well-structured, introducing new DA client methods, metrics, and the core state machine logic in the daWorkerLoop. My review identified two critical bugs related to incorrect loop variable capturing that could lead to data corruption, and a couple of medium-severity design and style issues. Once these points are addressed, the implementation will be much more robust.
…in daRetriever; update tests and syncer for subscription handling
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2961 +/- ##
==========================================
- Coverage 57.77% 57.32% -0.45%
==========================================
Files 98 98
Lines 9394 9549 +155
==========================================
+ Hits 5427 5474 +47
- Misses 3364 3471 +107
- Partials 603 604 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
95aeea4 to
ecfcf83
Compare
|
CI is not so glad. |
fixed |
| } | ||
|
|
||
| // Subscribe to forced inclusion namespace if configured | ||
| var forcedInclusionCh <-chan *blobrpc.SubscriptionResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to follow the force included namespace. The retriever itself does the caching itself. Maybe we should align this logic in the force inclusion retriever as well instead of using the async block fetching (in da)
| s.logger.Error().Err(err).Uint64("height", resp.Height).Msg("failed to process data subscription") | ||
| } | ||
|
|
||
| case resp, ok := <-forcedInclusionCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, this is dead code
Overview
this pr add sync modes allowing us to move to subscriptions when at the head of the chain